Kutay Akalın
05/01/2020
Data includes 17007 strategy games on the Apple App Store. It was collected on the 3rd of August 2019, using the iTunes API and the App Store sitemap. For this analysis, data downloaded directly from Kaggle website as CSV format and uploded the personal Github page.
I chose this dataset because I have great interest about the game app market. I want to analyse popular sub-genres, their user ratings, user ratings count to have insight about the user preferences. If a developer can develop a game based on these preferences, in my opinion, it is more likely to be successful.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
import io
import requests
import warnings
warnings.filterwarnings('ignore')
url="https://github.com/pjournal/mef03-KutayAkalin/blob/master/appstore_games.csv?raw=True"
s=requests.get(url).content
appstore_games=pd.read_csv(io.StringIO(s.decode('utf-8')))
display(appstore_games.info())
display(appstore_games.describe())
display(appstore_games.describe(include='O'))
We see that there are null (NaN) objects in the some columns. We can calculate the empty values with:
appstore_games.isna().sum()
msno.matrix(appstore_games)
Firstly, we need to fill or delete empty values. If user raiting count for apps below five, both 'User Raiting Count' and Average User Raitings have NaN values. Thus, they need to be cleaned for further analysis. In adition, I add "Free_or_Paid" column.
#Dropping unnecessary columns
gameanalyse = appstore_games.copy()
gameanalyse = gameanalyse.drop(columns="URL")
gameanalyse = gameanalyse.drop(columns="Icon URL")
gameanalyse = gameanalyse.drop(columns="ID")
#Dropping User Rating NaN Values
gameanalyse = gameanalyse[pd.notnull(gameanalyse['User Rating Count'])]
len(gameanalyse)
# Changing date colums, object to date-time
gameanalyse['Original Release Date'] = pd.to_datetime(gameanalyse['Original Release Date'], format = '%d/%m/%Y')
gameanalyse['Current Version Release Date'] = pd.to_datetime(gameanalyse['Current Version Release Date'], format = '%d/%m/%Y')
#Adding Free - Paid Column
for i in gameanalyse.index:
if gameanalyse.loc[i,'Price'] == 0.0:
gameanalyse.loc[i,'Free_or_Paid'] = 'Free'
else:
gameanalyse.loc[i,'Free_or_Paid'] = 'Paid'
gameanalyse=gameanalyse.reset_index()
gameanalyse.head()
For genre analysis, I melted the "Genres" column and extract the sub-genres. After that, I grouped the main sub-genres and shown them in the "Genre" column.
#Grouping Genres
game_genres=gameanalyse.copy()
game_genres=game_genres[game_genres['Primary Genre']=='Games']
game_genres['Genre'] = game_genres['Genres'].str.replace(',', '').str.replace('Games', '').str.replace('Entertainment', '').str.replace('Strategy', '')
game_genres['Genre'] = game_genres['Genre'].str.split(' ').map(lambda x: ' '.join(sorted(x)))
game_genres['Genre'] = game_genres['Genre'].str.strip()
#For empty genre rows (means that has no sub-genre), it is filled with 'General'
index = game_genres.index[game_genres['Genre']==""].tolist()
game_genres.loc[index,'Genre'] = 'General'
game_genres['Genre']
#After analysing the genre distributions, some combined genres distributed the main genres.
game_genres.loc[game_genres['Genre'].str.contains('Puzzle'),'Genre'] = 'Puzzle'
game_genres.loc[game_genres['Genre'].str.contains('Simulation'),'Genre'] = 'Simulation'
game_genres.loc[game_genres['Genre'].str.contains('Action'),'Genre'] = 'Action'
game_genres.loc[game_genres['Genre'].str.contains('Board'),'Genre'] = 'Board'
game_genres.loc[np.logical_and(game_genres['Genre'].str.contains('Role'),game_genres['Genre'].str.contains('Playing')),'Genre'] = 'Role Playing'
game_genres.loc[game_genres['Genre'].str.contains('Casual'),'Genre'] = 'Casual'
game_genres.loc[game_genres['Genre'].str.contains('Card'),'Genre'] = 'Card'
game_genres.loc[game_genres['Genre'].str.contains('Adventure'),'Genre'] = 'Adventure'
game_genres.loc[game_genres['Genre'].str.contains('Sports'),'Genre'] = 'Sports'
game_genres.loc[game_genres['Genre'].str.contains('Family'),'Genre'] = 'Family'
game_genres.loc[game_genres['Genre'].str.contains('Education'),'Genre'] = 'Education'
game_genres.loc[game_genres['Genre'].str.contains('Word'),'Genre'] = 'Word'
game_genres.loc[game_genres['Genre'].str.contains('Music'),'Genre'] = 'Music'
game_genres.loc[game_genres['Genre'].str.contains('Trivia'),'Genre'] = 'Trivia'
#Re-Indexing and selecting necessary coulmns for further anlaysis.
game_genres=game_genres.reset_index()
game_genres=game_genres.drop(columns=['level_0','Primary Genre','Genres'])
game_genres.head()
len(game_genres['index'])
Some of the populer sub-genres displayed below:
popular_genres = game_genres.groupby("Genre").agg('count').sort_values(by=['Name'],ascending=False)
popular_genres = popular_genres[popular_genres['Name']>20].loc[:,'Name'].reset_index()
popular_genres.rename(columns={'Name':'Count'},inplace=True)
popular_genres
plt.figure(figsize=(15, 7))
plt1=sns.countplot(x="Average User Rating", data=gameanalyse,palette="rocket")
plt1.set_ylabel('Frequency', fontsize = 20)
plt1.set_xlabel('Average User Rating', fontsize = 20)
plt.show()
plt.figure(figsize=(15, 7))
plt1=sns.countplot(x="Age Rating", data=gameanalyse)
plt1.set_ylabel('Frequency', fontsize = 20)
plt1.set_xlabel('Age Rating', fontsize = 20)
plt.show()
from wordcloud import WordCloud
fig, ax = plt.subplots(1, 2, figsize=(16,16))
wordcloud = WordCloud(background_color='white',width=800, height=800).generate(' '.join(gameanalyse['Name']))
wordcloud_sub = WordCloud(background_color='white',width=800, height=800).generate(' '.join(gameanalyse['Subtitle'].dropna().astype(str)) )
ax[0].imshow(wordcloud)
ax[0].axis('off')
ax[0].set_title('Wordcloud(Name)')
ax[1].imshow(wordcloud_sub)
ax[1].axis('off')
ax[1].set_title('Wordcloud(Subtitle)')
plt.show()
x = gameanalyse['Price']
x =x[np.logical_not(np.isnan(x))]
plt.figure(figsize=(16, 8))
sns.kdeplot(x,shade = True, linewidth = 5)
gameanalyse['Price'].describe()
x = appstore_games['User Rating Count']
x = x[np.logical_not(np.isnan(x))]
plt.figure(figsize=(16, 8))
sns.kdeplot(x,shade = True, linewidth = 5)
gameanalyse['User Rating Count'].describe()
fig = px.box(gameanalyse, y="Size")
fig.update_layout(
title={
'text': "Size Box Plot",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
gameanalyse['Size'].describe()
As we can see in the box plot, there is lots of outlier included in the size distribution. We can get better insight by removing outlier values.
q75, q25 = np.percentile(gameanalyse['Size'], [75 ,25])
iqr = q75 - q25
upper = q75 + 1.5*iqr
lower = q25 - 1.5*iqr
size_analyse = gameanalyse[pd.notnull(gameanalyse['Size'])]
size_analyse=size_analyse[np.logical_and(size_analyse['Size']<upper,size_analyse['Size']>lower)]
size_analyse
plt.figure(figsize=(16, 8))
sns.kdeplot(size_analyse['Size'],shade = True,)
size_analyse['Size'].describe()
ax = sns.FacetGrid(gameanalyse, col="Age Rating", col_wrap=2, height=6, aspect=2, sharey=False)
ax.map(sns.countplot, 'Average User Rating', alpha = 0.7, linewidth=4, edgecolor= 'black')
plt.subplots_adjust(hspace=0.45)
plt.show()
fig = px.bar(popular_genres, x='Genre', y='Count')
fig.update_layout(
title={
'text': "Bar Graph of Genre's",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
For pairplot relation analysis, popular 5 genre are selected.
sns.set(style="ticks")
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]
game_genres_ds = game_genres.loc[:,["Average User Rating","User Rating Count","Price","Age Rating","Size","Free_or_Paid","Genre"]]
game_genres_ds = game_genres_ds[game_genres_ds.Genre.isin(popular_genres)]
plt.figure(figsize=(16, 8))
sns.set(style="ticks",color_codes=True)
sns.pairplot(game_genres_ds, hue="Genre")
#Popular Genre Ratings
ax = sns.FacetGrid(game_genres_ds, col="Genre", col_wrap=2, height=6, aspect=2, sharey=False)
ax.map(sns.countplot, 'Average User Rating', alpha = 0.7, linewidth=4, edgecolor= 'black')
plt.subplots_adjust(hspace=0.45)
plt.show()
fig = px.box(game_genres_ds,x="Genre", y="Size")
fig.update_layout(
title={
'text': "Genre and Size Box Plot",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
temp_df = gameanalyse.groupby(['Original Release Date']).Size.sum().reset_index()
fig = px.line(temp_df, x='Original Release Date', y='Size')
fig.show()
Correlation heatmap between Average User Rating, Price, User Rating Count and Size columns:
game_genres_cor = gameanalyse.copy()
game_genres_cor = game_genres_cor.drop(columns=["index"])
game_genres_cor = game_genres_cor.corr(method='pearson')
plt.figure(figsize=(14, 14))
ax = sns.heatmap(
game_genres_cor,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(20, 220, n=200),
square=True, annot = True
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right'
)
ax.set_ylim(len(game_genres_cor)+0.5, -0.5);
Null Hypothesis:
H0: Average User Rating and Genre types are independent.
Alternate Hypothesis:
H1: Average User Rating and Genre types are not independent.
import scipy.stats
table1 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Genre"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table1)
table1
msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Reject H0 (p_value = 0.000)
We have strong statistical evidence that order Average User Rating is not independent of Genre (p_value = 0.000).
Null Hypothesis:
H0: Average User Rating and Price are independent.
Alternate Hypothesis:
H1: Average User Rating and Price are not independent.
table2 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Price"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table2)
msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Can not reject H0 (p_value=0.755).
There is not enough statistical evidence that Average User Rating and Price variables are not independent (p_value = 0.755).
Null Hypothesis:
H0: Average User Rating and Age Rating are independent.
Alternate Hypothesis:
H1: Average User Rating and Age Rating are not independent.
table2 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Age Rating"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table2)
msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Reject H0 (p_value~0.000).
There is strong statistical evidence that Average User Rating and Price variables are not independent (p_value = 0.755).
Average user rating test between board and puzzle games:
Null Hypothesis:
H0: Average user ratings of puzzle games is smaller or equal than board games.
Alternate Hypothesis:
H1: Average user ratings of puzzle games is higher than board games.
from statsmodels.stats.weightstats import ztest
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]
Puzzle = game_genres[game_genres['Genre'] == 'Puzzle']['Average User Rating']
Board = game_genres[game_genres['Genre'] == 'Board']['Average User Rating']
ztest(Puzzle,Board,alternative = 'larger')
There is strong statistical evidence that average user rating of puzzle games are higher than board games.(p-value~0.000)
App size test between board and puzzle games:
Null Hypothesis:
H0: Size of puzzle games is higher or equal than board games.
Alternate Hypothesis:
H1: Size of puzzle games is smaller than board games.
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]
Puzzle = game_genres[game_genres['Genre'] == 'Puzzle']['Size']
Board = game_genres[game_genres['Genre'] == 'Board']['Size']
ztest(Puzzle,Board,alternative = 'smaller')
There is strong statistical evidence that size of puzzle games are smaller than board games.(p-value~0.000)
Average user rating test between +9 and +17 games:
Null Hypothesis:
H0: Average user ratings of +17 games is higher or equal than +9 games.
Alternate Hypothesis:
H1: Average user ratings of +17 games is is smaller than +9 games.
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]
Puzzle = game_genres[game_genres['Age Rating'] == '17+']['Average User Rating']
Board = game_genres[game_genres['Age Rating'] == '9+']['Average User Rating']
ztest(Puzzle,Board,alternative = 'smaller')
There is strong statistical evidence that average user ratings of +4 games is is smaller than +17 games.(p-value=0.005)
plt.figure(figsize=(18,10), dpi= 100)
ax = sns.regplot(data=game_genres_ds, x='Size', y='Average User Rating', color = 'darkred')
ax.set_ylabel('Average User Rating', fontsize = 20)
ax.set_xlabel('Size', fontsize = 20)
plt.show()
As we see in the graph, there is little positive relationship between size and average value.
We know that correlation values of numeric variables are very low for average user rating and it depends on some categorical variables such as genre and age rating. But, I tried to do regression analysis where dependent variable is average user rating.
For this purpose following model was build for testing:
y(average rating) = B0 + B1(user rating count) + B2(price) + B3*(size)
import statsmodels.api as sm
X = gameanalyse[['User Rating Count','Price','Size']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = gameanalyse['Average User Rating']
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
As we see in the result, adjusted R-squared value is too low for validate this model. Thus, this relation can not be explain as linear model with these variables.
gameanalyse.sort_values(by=['Average User Rating', 'User Rating Count'], ascending=False)[['Name', 'Average User Rating', 'User Rating Count', 'Size', 'Price', 'Developer','Genres']].head(10)
gameanalyse.sort_values(by=['User Rating Count'], ascending=False)[['Name', 'Average User Rating', 'User Rating Count', 'Size','Price', 'Developer','Genres']].head(10)
During the inferential analysis, we found that: